Time Series Regression Model

1. GDP By Region

The GDP by Region (in million) data obtained from Statistics New Zealand is annually therefore it needs to be converted into monthly data. The process of converting the data into monthly is split into two parts : interpolate the annual data into quarterly and interpolate the converted quarterly data into monthly. The formula used for coverting the annual data into quarterly data is as follow:

\[\text{GDP}_{year,quarter} = \frac{\text{GDP}_{year + 1}}{4} + \frac{\Delta_{year + 1}\times \text{nb}}{10}, \hspace{.3in} \Delta_{year + 1} = \text{GDP}_{year + 1} - \text{GDP}_{year}\] where \(\text{nb} = 1,2,3,4\) corresponding to \(quarter = Q1,Q2,Q3,Q4\), respectively for each \(year = 2008,...,2018\).



The quarterly data can then be converted into monthly data using cubic interpolation (spline() function).

2. Seasonal Dummy

Figure 1. Monthly seasonal time series data

Figure 1. Monthly seasonal time series data

Family A frequently bought oranges in January, March, May, August and November but did not do so during the other months.

The aim is to forecast the 2020 bought oranges for Family A. We can model this data using a regression model with a linear trend and monthly dummy variables, \[y_t = \beta_o + \beta_1t + \beta_2m_{2,t} + \beta_3m_{3,t} + \dots + \beta_{12}m_{12,t} + \epsilon_t\] where \(\beta_1\) is the trend predictor and \(\beta_2m_{2,t},\dots,\beta_{12}m_{12,t}\) are the seasonal dummy predictors for 12 months. Notice that only eleven dummy variables are needed to code twelve categories. That is because the first category (in this case January) is captured by the intercept, and is specified when the dummy variables are all set to zero. In R the trend predictor is coded as trend and the seasonal dummy predictor is coded as season.

sumVal <- tapply(training, cycle(training), FUN = sum)
fit.reg <- tslm(training ~ trend + relevel(season, ref = which.min(sumVal)))
names(fit.reg$coefficients)[3:13] <- paste("month", substr(names(fit.reg$coefficients)[3:13],41,42))
summary(fit.reg)
## 
## Call:
## tslm(formula = training ~ trend + relevel(season, ref = which.min(sumVal)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.000  -3.917   0.000   8.083  26.667 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 20.75000   10.91432   1.901   0.0699 .
## trend       -0.19444    0.27346  -0.711   0.4842  
## month 1     10.11111   13.30698   0.760   0.4551  
## month 2     15.63889   13.26477   1.179   0.2505  
## month 3      1.16667   13.22807   0.088   0.9305  
## month 4     17.36111   13.19694   1.316   0.2013  
## month 5     -0.11111   13.17142  -0.008   0.9933  
## month 6      5.75000   13.15153   0.437   0.6660  
## month 7     -0.05556   13.13731  -0.004   0.9967  
## month 8     12.80556   13.12877   0.975   0.3395  
## month 10    27.19444   13.12877   2.071   0.0497 *
## month 11    11.38889   13.13731   0.867   0.3949  
## month 12    11.91667   13.15153   0.906   0.3743  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.08 on 23 degrees of freedom
## Multiple R-squared:  0.2923, Adjusted R-squared:  -0.07697 
## F-statistic: 0.7915 on 12 and 23 DF,  p-value: 0.6547

The p-value reported is the probability of the estimated \(\beta\) coefficient being as large as it is if there was no real relationship between the response variable and the corresponding predictor. In this case, no months is shown to have an effect on the number of oranges bought, implying seasonality is not significant.


Figure 2. Forecast of the predicted regression model

Figure 3. Time plot of Family A tre and predicted bought oranges

Figure 3. Time plot of Family A tre and predicted bought oranges

## 
##  Pearson's product-moment correlation
## 
## data:  training and fitted(fit.reg)
## t = 3.7472, df = 34, p-value = 0.0006641
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2578955 0.7380690
## sample estimates:
##       cor 
## 0.5406252

Figure 2 plots the actual (training) data versus the fitted data. If the predictions are close to the actual values,\(\text{R}^2\) is expected to be close to 1. The Pearson correlation test shows that the correlation between these variables is \(\text{R}^2\) = 0.54. In this case model does an alright job as it explains 54% of the variation in the Family A data.

Figure 4. Time plot of Tasman visitors and predicted Tasman visitors

Figure 4. Time plot of Tasman visitors and predicted Tasman visitors

A plot of the residuals against the fitted values should show no pattern. If a pattern is observed, there may be “heteroscedasticity” in the errors which means that the variance of the residuals may not be constant.

The residuals based on the residual plots are not showing any obvious patterns or trends, indicating constant variance.

Figure 5. Residuals of regression model including trend and seasonal dummy predictors

Figure 5. Residuals of regression model including trend and seasonal dummy predictors

The CV() (short for cross-validation statistic) function computes the CV, AIC, AICc and BIC measures of predictive accuracy for a linear model. For these measures, the model fits the data better with the lowest value of CV, AIC, AICc and BIC; the model fits the data better with the highest value for Adjusted \(\text{R}^2\). Note: This is useful when studying the effect of each predictor, but is not particularly useful for forecasting.

CV(fit.reg)
##           CV          AIC         AICc          BIC        AdjR2 
## 405.81432099 211.83827133 231.83827133 234.00753646  -0.07697186

This function’s purpose is to select the best predictors to use in a regression model when there are multiple predictors. Here the result shows that the seasonal dummy variable should be included in the model.


Other methods

1. STL-ETS model

The stlf() function decomposes the time series using STL, forecast the seasonally adjusted data (data without seasonality) and return the ‘reseasonalized’ forecasts (forecasts that take the seasonality into account). If the method argument is not specified, the function will use the ETS approach applied to the seasonally adjusted series.

STL <- stlf(training, biasadj = TRUE, h = 12)
STL$model$aic
## [1] 318.9962


2. NNAR model

The nnetar() function in the forecast package for R fits a Neural Network Model (NNAR) to a time series with lagged values of the time series as inputs (and possibly some other exogenous inputs). It is therefore a non-linear autogressive model, allowing complex non-linear relationships between the response variable and its predictors. The NNAR model for seasonal data has the form: \[NNAR(p,P,k)[m]\]

set.seed(2015)
nnetar.model <- nnetar(training)
NNAR <- forecast(nnetar.model, h = 12, biasadj = TRUE)
nnetar.model
## Series: training 
## Model:  NNAR(1,1,2)[12] 
## Call:   nnetar(y = training)
## 
## Average of 20 networks, each of which is
## a 2-2-1 network with 9 weights
## options were - linear output units 
## 
## sigma^2 estimated as 65.04

Since NNAR models usually (and in this case) have no underlying statistical model, calculating an AIC/BIC does not make sense here. A possible solution to select the best model is to fit various models to 90% of the data and use these models to forecast the remaining 10%, i.e., use a holdout sample. Choose the model that performs best on the holdout sample (“best” will depend on the error measure(s)). Refit this model based on the entire sample.


### 3. TBATS model Both the NNAR and TBATS models are mainly used for series exhibiting multiple complex seasonalities. TBATS is short for Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components:

  • T for trigonometric regressors to model multiple-seasonalities
  • B for Box-Cox transformations
  • A for ARMA errors
  • T for trend
  • S for seasonality

The TBATS model can be fitted using the tbats() command in the forecast package for R. The forecast function when running the TBATS model only returns the AIC value hence in this section we are comparing models using AIC. However AIC is not valid for neither NNAR or Combination (the combination is not really model but merely an average of all the methods’forecasts) thus these methods are going to be compared using RMSE.

TBATS <- forecast(tbats(training, biasadj = TRUE, use.box.cox = FALSE), h = 12)
TBATS$model$AIC
## [1] 329.7746


4. Forecast Combinations

An easy way to improve forecast accuracy is to use several different methods on the same time series, and to average the resulting forecasts. The forecasts used in this example are from the following models: ETS, ARIMA, STL-ETS, NNAR, and TBATS.

Combination <- (ARIMA[["mean"]] + ETS[["mean"]] + STL[["mean"]] + NNAR[["mean"]] + 
                  TBATS[["mean"]])/5

Though the Combination models performance is particularly well in this series, the NNAR model’s performance is shown to have the smallest RMSE error value.

##        ARIMA      ETS  STL-ETS    NNAR    TBATS Combination
## RMSE 18.5967 17.88111 23.48746 14.3574 18.23517    17.81031


Results

Table 4: True values vs. Predicted values for non-transformed data
year2018 TrueData ARIMA ETS Seasonal STL NNAR TBATS Combination
Jan 32 33 26 23 26 36 29 30
Feb 8 23 26 28 32 9 29 24
Mar 23 27 26 14 17 37 29 27
Apr 34 26 26 30 34 9 29 25
May 46 26 26 12 16 35 29 26
June 24 26 26 18 22 15 29 24
Jul 17 26 26 12 17 28 29 25
Aug 34 26 26 25 30 18 29 26
Sep 12 26 26 11 17 29 29 25
Oct 44 26 26 38 44 16 29 28
Nov 4 26 26 22 28 36 29 29
Dec 15 26 26 23 29 21 29 26


Overall, the best model for the non-transformed data is the ARIMA model, ARIMA(1,0,0)(0,1,0)[12] and the best model for the transformed data is the Combination model. This statement is roughly based on the difference between the model’s predicted values and the true value of the original data for the Tasman Glacier data.


Table 6: True values vs. Predicted values for transformed data
year2018 TrueData ARIMA ETS Seasonal STL NNAR TBATS Combination
Jan 32 33 26 23 26 36 29 30
Feb 8 23 26 28 32 9 29 24
Mar 23 27 26 14 17 37 29 27
Apr 34 26 26 30 34 9 29 25
May 46 26 26 12 16 35 29 26
June 24 26 26 18 22 15 29 24
Jul 17 26 26 12 17 28 29 25
Aug 34 26 26 25 30 18 29 26
Sep 12 26 26 11 17 29 29 25
Oct 44 26 26 38 44 16 29 28
Nov 4 26 26 22 28 36 29 29
Dec 15 26 26 23 29 21 29 26


To-Do List

  • Attempt R Shiny : write functions for applying ARIMA method on multiple DoC data sets (Track levels).
  • Review Dynamic Regression Models as another possible model.
  • Impose a constraint on the forecasts to ensure they stay within some specified range [a,b].